Math 419/592 Project Rules and Ideas, Version 1
Winter 2008
Project Rules
Project proposals are required from everyone. A proposal
should be at least 1 page long, and will usually include roughly 3
references. The proposal should also specify a journal that the
results of this project might be submitted to, plus a secondary
(backup) journal. If you need to submit an Inter-Library Loan
request, remember to do it early.
Project reports should be in the form of a submission to a specific
journal that you think would be the most appropriate.
Prof. Ross will let you know if you should actually submit
it to the journal; depending on the amount of help you get, a
co-authorship may be appropriate. A typical organization for a paper is:
- Title, authors, affiliation
- Abstract and keywords
- Introduction
- Literature Review
- Model
- Methods
- Results
- Conclusions
- Appendix: computer code, other lengthy material
though this may be changed somewhat. Try to emulate (but do
not copy!) well-written papers that you have read.
Here is the timeline for the project:
10% Feb. 9: project proposal (>=1 pages)
40% Feb. 23: project submission to Prof. Ross
--% Mar. 6: referee reports returned
40% Mar. 13: project revisions submitted to Prof. Ross
10% Mar. 15: presentations (5 minutes per project)
If you are happy with the grade you receive for the first
submission and elect to do nothing, then you need not turn in
a revision. In that case, you will receive the same grade for
the second 40% as you did for the first 40%.
If you miss the first submission deadline, your maximum score
on the overall project will be (10+0+40+10)=60; this is, how
shall I put this, not recommended. Similarly, turning in an
incomplete paper for that first submission will drag down your
score, so I recommend to you that you make good use of my
office hours to get guidance on your projects.
Also, please make a copy of most of the sources you referenced, so I can
read what they have to say.
Project Ideas
This is a preliminary list of project ideas. More ideas are sure to
follow. If you have an idea you don't see on this list, please let me
know. The lists are divided into mid-term and final lists, mainly
because some topics won't be covered until the second half of the
semester.
It is important to realize: most of these project ideas will need
guidance from Prof. Ross. If you try to do them by just reading these
tiny descriptions, things will go very badly. You should plan on
meeting with Prof. Ross to discuss possible projects, a week or so
before the proposal deadline.
One of the student course evaluations from last year said "Give us
more time on the projects." Well, the semester is 15 weeks long or
so, broken into halves, and you're free to start the project(s) during
week 1. I don't see how I can give you more time. Use the time you
have wisely.
Note: DTMC = Discrete-Time Markov Chain, CTMC = Continuous-Time Markov
Chain.
Topics that I would particularly like to see done are marked with a
star (*).
Mid-Term Project
- farecast.com
-
farecast.com tries to predict how airline fares will change in
time: should you buy now, or wait for the price to go down?
It's clearly a stochastic process. I haven't thought much more
about it. It may be more appropriate for the second half of
the semester. There are other, competing sites as well.
- Pavement Maintenance as a Markov Decision Process
-
Each year, you can do minor patching on a piece of road, or major
patching, or repave the whole thing. It will degrade according to a
random process. Look at this as a Markov Decision Process, prompted
by papers by Pablo Durango.
- Spam Filtering
-
Read various articles about spam filtering techniques. Maybe
implement a filter for yourself.
- 2-dimensional Poisson Processes, and others
-
It's fairly easy to simulate a 2-dimensional Poisson process
in Excel, SciLab, etc. It's not so obvious how to simulate other
2-d processes that are more or less bursty. Read up on models
that are more or less bursty than Poisson, and figure out how
to simulate them in Excel or Scilab, hopefully in an efficient
manner.
- Hyperexponential Helper and Coxian Companion
-
Create two spreadsheets or Matlab programs that will help users design
Hyperexponential or Coxian distributions that have specified values
for their mean, variance, and 3rd moment. See Prof. Ross for ideas on
user interface, etc.
- Approximations for Ej/Ek overtaking probabilities
-
What is the probability that someone who arrives after you will leave
before you, in a multi-server system? We can calculate it exactly
using a DTMC for Erlang arrivals and service durations. But is there
a faster way to approximate it? Do some of the detailed calculations
using pre-written Matlab code, do graphs, and fit some curves.
Problem size: as small as 10x10, possibly up to 1000x1000 or so.
- Backgammon Simulation
-
A simulation for Backgammon that Prof. Ross did some years ago had
some odd results. Re-run the simulation and see if the random number
generator is causing the problems. Little or no knowledge of
Backgammon required.
- Stochastic Orderings Illustrated
-
Read about Stochastic Orderings and various distribution classes (New
Better than Used in Expectation, etc.). Find ways to diagram the
relationships between classes and orderings. Do some graphs to
illustrate various orderings and classes. Prof. Ross can provide some
suggestions.
- Search Engine Rankings
-
Google (and probably other search engines) use something like a DTMC
to rank pages. Read up on how they do this, and maybe do a miniature version
for yourself. This might be related to the "build a better recommender"
class that our CompSci department is running right now (Winter 2007).
- * Stochastics and Weather Models
-
What does it mean when a weather forecast says "40 percent chance of rain"
or "4-5 inches of snow tonight"? What is a typical correlation between
two nearby regions in terms of rainfall? Read meteorology textbooks and
journal articles to get a comprehensive picture of how stochastics
and weather models interact.
Here's an idea to focus your attention: suppose you work for a company
whose operations are affected by the weather: a ski lodge, an airline,
a school district or university that might shut due to snow, an
amusement park, a city public-works (snowplow) department, etc. What
would you want to know about weather forecasting to do your job well?
- Linear Algebra/DTMC lecture
-
Prepare and give a 50-minute lecture intended for a linear algebra
class (like Math 205) on the wonders of DTMCs. Include: matrix
multiplication as transient calculations, linear-systems for
steady-state probabilities, eigenvalues to show that the transition
matrix is non-invertible, and maybe first-passage probabilities.
Use PowerPoint or LaTeX slides.
- DTMCs and Inventory
-
Learn some basic DTMC models for inventory. Either do an in-depth
literature review, or a partial literature review plus some
computations.
- Other literature review
-
Do a literature review for some field that you're interested in, with
at least some overlap with stochastics.
- DTMC numerical solvers
-
There are a number of ways to find the steady-state solution for a
DTMC. Program one or two of the moderately advanced ones, such as
Gauss-Seidel or SOR. Hopefully in Matlab, but maybe in Excel.
- * Quasi-Random Numbers
-
When we use functions like =RAND() on Excel, it generates what are
called pseudo-random numbers. Quasi-random numbers are different: they
are deliberately spaced (and correlated) so that they are more uniform.
Figure out how to generate quasi-random values in Excel, OpenOffice Calc,
Matlab, or Scilab. Other questions to ask: what does a Poisson process
generated with quasi-random numbers look like/how does it behave differently?
If I use a quasi-random Poisson process as the arrival stream for a
queue (possibly with quasi-random service as well), how does the queue
behave?
- * Geom/Geom/1 queue waiting times
-
In the simplest discrete-time queueing system, the time that someone
spends waiting has a Geometric distribution. The time that the next
person spends waiting is also Geometric, and the two are dependent (if
one person waits a long time, the person behind them will probably
wait a long time too). Set up a DTMC model and solve for its
transient probabilities to compute the correlation coefficient, etc.
Problem size: maybe a few hundred states? Probably better for Matlab
than Excel.
- Board games as DTMCs
-
Monopoly (the board game) has been analyzed as a DTMC. Do the same
for some other board game(s).
Problem size: Candyland, Snakes-and-Ladders, and Monopoly would each be in the
50x50 to 100x100 range, I think.
Also, see the extra comments at the bottom of this document.
- DNA or Protein as a DTMC
-
Note: this project was done last year in a less-than-formal way. If
you do it this year, you should expect to do formal statistical
tests.
Look for some papers that discuss whether DNA sequences or protein
sequences follow DTMCs. Take some sequence data (freely available on
the internet) and analyze it to see if it's a DTMC. For extreme extra
credit, generate DNA or protein sequences by a DTMC, construct the
molecules, and see if they cure any major diseases.
Another possibility is to get two separate DTMC data sets: one from
genes, and the other from regions of junk DNA. Then, take a new
sequence of DNA (gene or junk) and compute the likelihood it came from
one DTMC or the other. This would require a fair amount of coding.
Problem size: 4x4, 16x16, 64x64 at least--maybe also 256x256
- Music as a Markov Chain
-
Again, this suggestion is left over from last year. I highly
discourage it, unless you plan on doing a very formal analysis. For
example, formal statistical tests to see if the music data does form a
DTMC might be acceptable.
Problem size: 8x8 (one diatonic octave, one-note history), 13x13 (one
chromatic octave, one-note history), approximately 64x64 (one diatonic octave, two-note history), etc.
- Babies learn language by DTMC?
-
Again, this project is discouraged due to the lack of formality.
A recent paper proposed that babies find the divisions between spoken
words by using a DTMC sort of process. Read that paper and any
related papers. Do a small experiment using some set of language
data.
Problem size: in the range of 100x100 to 1000x1000
- Take average or maximum of scores?
-
In the Olympics, some sports are scored using the average of two
performances, while others use the maximum of two performances.
How would either scoring system affect your strategy on the first
performance, and then on the second performance? Try to come up
with a mathematical model that includes the degree to which you
"go for it" on each performance, plus some randomness in the results
of your own efforts.
Similarly, some on-line homework systems allow students to re-take
the homework (using newly generated random numbers) over and over again.
The professor has the option of using their best score, last score,
or average score. What are the implications for student tactics,
and what should the professor choose?
Final Project
- Analyze the show "1 vs. 100"
-
The game show "1 vs. 100" has some interesting decisions to
be made, about whether you should quit now and go home with your
accumulated winnings, or keep going and hope for more winnings.
Similar problems are present in Deal or No Deal, and Who Wants
to be a Millionaire (how's that for dated!), but 1 vs. 100 is
more complicated.
- Multi-Armed Bandit Problem
-
Suppose you are seated at a collection of slot
machines. You don't know the probability of payout
on each one (they may all be different). You'd
like to find out which one is the best, and only
play that one, but to do that you have to play all
of them a little bit to get an idea of how they
pay out. What should you do with your next coin?
Should you play it on the best one you've found so
far, or should you try a different one to increase
your knowledge of its payout probability?
There may be an analogy here to how PBS TV stations
do fundraising drives. They play a bunch of special
shows, then see which ones did well, then play those
some more.
- Mt/D/c program
-
Mt means a time-varying Poisson arrival process; D means deterministic
service times; "c" means a user-specified number of servers. Do the
transient calculations by creating and multiplying transition matrices
in a DTMC, using VBA and a spreadsheet interface. Or, use Matlab.
Technically this is just a DTMC rather
than a CTMC, but it's still okay for the final project.
- M/D/c approximation
-
We have a proposed approximation for the number of servers that an M/D/c
system would need in order to provide good service. Evaluate some
actual M/D/c systems to see how many servers they actually need, and
compare to the approximation.
Technically this is just a DTMC rather
than a CTMC, but it's still okay for the final project.
- Geom/Geom queues
-
Geom/Geom is the discrete equivalent of an M/M queue. Find and summarize
some seminal papers, and implement algorithms in Excel, Matlab, or other
package. Numerically compare to M/M queues.
Technically this is just a DTMC rather
than a CTMC, but it's still okay for the final project. Or, you could
do it as a midterm project.
- M/E2/c program
-
Write a program (VBA or Matlab or ???) to evaluate the performance for
an M/E2/c queueing system. This is a two-dimensional CTMC. Once it
is written, it will be fairly easy to extend it to H2 or C2 service.
- Mt/Mt/c/K program
-
Write some extensions to an existing VBA/spreadsheet program that
helps analyze queues whose arrival and service rates change
hour-by-hour. For example, one extension is to allow arrival rates
to vary according to the number of people in the system (called a
machine-repair model). Write a report describing the theory of the
program.
- Dam Queues
-
No, I'm not frustrated with queues. I mean, consider the process of
water inflow/outflow at a dam as a queue. Find seminal papers, etc.
Perhaps consider networks of dams, and also the economics of selling
electricity from hydro-power, which can be related to Dynamic
Programming as well.
- * Counter-Intuitive Behavior in Queues
-
Usually, increasing the arrival rate, or the mean service duration,
makes a queue perform worse. Also, increasing the variance usually
makes things worse. However, there are some cases where this is
not the case. Do a literature search for such examples, and try to
find unifying themes. J.B. Atkinson, 2000 is a good paper.
Gross and Harris, page 180, mentions a few results in queueing networks.
- * Handicapped Parking Spaces
-
Research guidelines for the number of handicapped spaces in a parking
lot. What is the reasoning behind the guidelines? Does it compare to
what Linda Green discovered about the reasoning behind bed occupancy
targets for hospitals? How about guidelines for number of handicapped
stalls in restrooms?
- * Markovian Arrival Process library
-
Prof. Ross has written a Matlab library that deals with continuous
phase-type distributions (which are based on CTMCs). This can be
extended to work with Markovian Arrival Processes, which allow
some dependence from one inter-arrival to the next.
- * Discrete Phase-Type library
-
Prof. Ross has written a Matlab library that deals with continuous
phase-type distributions (which are based on CTMCs). Translate it to
work with discrete phase-type distributions (which are based on
DTMCs). Also see "Fitting discrete distributions on the first two moments"
by Adan, Van Eenige, and Resing.
- * CTMCs with negative rates
-
Sometimes we can bend the rules with CTMCs and give them negative
rates. It's unclear when this will work and when it won't. It may
also be related to matrix-exponential distributions. Do a
literature search, and a few experiments. I can provide some places to start.
- * DTMCs with negative probabilities
-
See if we can translate the above project (CTMCs with negative rates)
into DTMCs.
- * Stochastic models of sequential lead times
-
If you order from your supplier today, and then order again tomorrow,
your shipments will probably arrive in the same order. That is,
the later one won't overtake the earlier one. This is called "sequential
lead times", and is a common assumption in some inventory models.
Do a literature search on the details of these kinds of models. Also,
do a simulation of some sequential lead times under different distributions.
- * Airport security waiting times
-
Download
http://waittime.tsa.dhs.gov/waittimes-xml.zip
and play around with the data. Do some interesting graphs. Look for
unexpected things. See how often the file gets updated.
- * Stochastic Models of Alternative Energy
-
When planning how to meet electric power demand over the next 24 or 48
hours, we often assume that generators will perform as we tell them to.
This works pretty well for nuclear, coal, hydro, and gas, but new
alternative energy sources (mainly wind and solar) are more random.
Do a literature survey on stochastic/statistical models for their
generation abilities. Among other things, try to find out:
- how much correlation is there from one windmill to the next in a wind
farm?
- How much correlation is there between a solar farm and a wind farm?
- How much correlation is there between either and energy demand?
(e.g. hot day = lots of sunshine = lots of air conditioning demand)
- Brownian Motion and Inventory
-
Does anybody ever use Brownian motion or related processes
(Ornstein-Uhlenbeck, etc.) to model supply chain inventories? Do
a literature survey.
- Financial Engineering/Insurance
-
Look into the basic problem of Insurance: calculating the ruin
probability. Find and summarize some seminal papers,
and do a numerical investigation of a typical problem.
- Financial Engineering/Black-Scholes Formula lecture
-
Read the textbook section on Black-Scholes, and look up some of the
original journal papers. Prepare and deliver a 50-minute lecture on
the topic with PowerPoint or LaTeX slides.
- Renewal Theory for an industrial problem
-
Prof. Storer shared a renewal-theory problem with me, about barrels
that can get damaged and pulled from service, or expire after 2 years
no matter what. Compute the renewal function (not easy!), and draw
some graphs.
Also see
W. Stadje:
Note on the renewal process for truncated exponential variables. Statistics and Probability Letters 6 , 61 - 66 (1987).
- Renewal Theory: Inspection Paradox website
-
Prof. Ross has an idea for a web site that will help illustrate the
inspection paradox. Help him implement it, and write a report. We
might be able to get away with plain HTML, but it may require Java or Flash.
- Is it a Renewal Process?
-
Analyze some physical process (asteroid strikes, earthquakes,
radiation, supernova, gamma-ray burst, electric power blackouts, etc.)
to see if a Renewal Process (or even Poisson Process) is a good model.
This project may have a lower degree of difficulty. To increase the
difficulty, look up some references on formal statistical tests for
renewal processes; ask to see Prof. Ross's copy of the Bhat textbook
for more info.
- * Nonlinear Renewal Theory
-
Read books and articles about Nonlinear Renewal Theory, translate the
theoretical theorem statements to something understandable, and come up with
a few numerical examples.
- * Two-dimensional Renewal Theory
-
Read books and articles about two-dimensional Renewal Theory, translate the
theoretical theorem statements to something understandable, and come up with
a few numerical examples. Give examples of real-life situations where
it's useful. What is the equivalent of the Central Limit Theorem for
Renewal Processes (CLTRP)? What is the equivalent of the Inspection Paradox?
Starting references: "Renewal Theory in Two Dimensions: Basic Results" by Hunter,
"Renewal Theory in Two Dimensions: Asymptotic Results" by Hunter
(Basic Results also has some interesting links to bivariate exponential distributions), and
"What is a multi-parameter renewal process?" by Ivanoff, B. Gail and Merzbach, Ely
- Financial Engineering: Is it Brownian Motion?
-
Review articles that claim some physical or financial process follows
some kind of Brownian Motion. Find a data set and do your own
analysis. Try the Dow Jones average, individual stocks, or even
weather data.
This project may have a lower degree of difficulty, unless you research
formal statistical methods for detecting Brownian motion.
- Parking Lot Strategies, single lot
-
A recent paper analyzed some strategies for finding a good parking
space in a single lot. Read it (get a copy from me) and do some extensions.
- Parking Lot Strategies, multiple lots
-
Which parking lot should I try first when I drive to campus? If that one
fails, which one should I try afterwards? Things to contemplate:
my destination on campus,
the random time at which each lot fills up,
what time it is now,
how long it takes to determine if a lot is full,
time to drive from one lot to another,
etc. My goal might be to minimize walking distance, or it might be to
maximize my probability of making it to class at a certain time.
This is something like a Dynamic Program, but maybe not exactly.
- * Cell-Phone Traffic Models
-
In queueing theory, it's important to know how long people stay on the telephone. In cell-phone networks, things are complicated by people moving between
cells. Start by reading
http://en.wikipedia.org/wiki/Cellular_traffic
and maybe do a little simulation.
- Disneyland FastPass
-
Research how FastPass works for roller-coaster lines. Find any
academic articles on it. If none exist, do some stochastic modeling
of the situation.
- Yield Management
-
Look up some seminal papers on Yield Management, and describe the
state of the art. Tackle a small problem yourself. Maybe work with
the Lehigh undergrad admissions office?
- Financial Engineering: Stochastic Calculus lecture
-
Prepare and give a 50-minute lecture introducing the main ideas of
Stochastic Calculus with PowerPoint or LaTeX slides.
- Case-study from DVD
-
I have two or three DVDs that have hourlong cases. Try to do one
of them.
- Renewal Theory lecture
-
Prepare and give a 50-minute lecture on renewal theory. Include the
nice intuitive results and the not-so-intuitive results related to
the Inspection Paradox. Also include the offset terms for the renewal
function (and variance function).
- Other lectures
-
Prepare and give a 50-minute lecture on some other part of the book:
CTMCs, queueing theory, reliability theory, etc.
Further descriptions
- Music
-
There are two prominent music file formats. The first is MIDI, but it is
kind of difficult to work with. Another is called "ABC", and you
can find a lot more about it
here .
It might be easier to use, as it is all text-based.
- Board-game projects
-
For those of you who are doing board-game projects, a big question is
what the goal of your analysis should be. Let's consider a pure chance
game with no strategy, like Candyland or Snakes-and-Ladders.
A steady-state DTMC analysis is not possible, because it is a terminating
chain. A DTMC analysis would reveal the mean number of turns until
finishing from any board position. With a little more work, you can
get the whole distribution (and thus the variance). Some questions
then are:
- What is the probability that the first player wins, if there are 2 players?
- If there are more players, how do those probabilities change?
- Do a Normal approximation to winning probability using the mean
and variance of turns-to-finish from square 1.
- Plot the mean turns-till-finish, plus and minus 1 or 2 sigma, for each
square on the board.
- Plot a representation of your DTMC matrix (for example, using spy()
in Matlab)
- DNA or Proteins
-
I don't know if you have a preference for working with human data or with some other organism. My wife is most familiar with the yeast genome database, so that is what I will suggest to start with. The yeast we are talking about is the one that is used in making beer or in making bread. Go to
http://db.yeastgenome.org/cgi-bin/SGD/seqTools
and type "tif4631" (without the quotes) in the "Enter a Name" box. This is a gene that encodes a protein named EIF4G that is involved in "translation", the process of turning RNA into proteins. Hit "submit", and you get the results page.
In the first row of the table, "DNA of Region", click on "NoHeader" to get the AGTC stuff without other formatting information.
Also, in the 3rd row of the table, "Protein Translation of ORF", click on "NoHeader" to get the amino acid sequence that this gene codes for. There are 20 amino acids, each represented by a letter like M, T, D, E, etc.
By the way, ORF stands for Open Reading Frame. It basically means some
sequence of DNA that starts with a particular 3-letter sequence that means
"start reading here".
I think it would be a good idea to focus on data from defined genes, but if you want more data than that, you could always start with Chromosome One, Position One by starting at that same page
http://db.yeastgenome.org/cgi-bin/SGD/seqTools
and in Box 2, "Pick a Chromosome", pick chromosome 1. Do the same "NoHeader" thing, and you'll get a lot more data to play with. I'm not sure what the * means in the protein data; probably "stop" , but we could treat it just like any other symbol.
If you are more interested in working with human data, try starting at
http://www.ncbi.nlm.nih.gov/
but from that point, you know as much as I do.
- Babies Learning Language
-
Start by reading this web page:
The Why Files
.
They don't actually say "Markov chain", but I think as you read it you will get the idea.
One of the prominent people in the field seems to be
Prof. Saffran
.
Perhaps reading some of the papers she lists will be a good place to start. I'm pretty sure Lehigh has on-line access to the journal Science, at least.
Statistical Learning by 8-month-old Infants,
Jenny Saffran et al,
Science, December, 1996, v. 274, pp. 1926-8.
Word Segmentation: The Role of Distributional Clues,
Jenny Saffran et al,
Journal of Memory and Language, Aug. 1996, pp. 606-621.
- Counter-Intuitive Behavior in Queues
-
Places to look (I have copies of most of these):
- Wolff's Stochastic Modeling and the Theory of Queues has a few examples
- Cooper's Introduction to Queueing Theory, 3rd ed. pgs 298-301.
- "Resource Sharing for Efficiency in Traffic Systems" by D.R. Smith and Ward Whitt, 1981, BSTJ
- Davis, Massey, and Whitt 1995
- Shortle et al., various parametrizations of Paretos
- various Whitt papers?
- look through QUESTA abstracts?
Deprecated Project Ideas
The following project ideas come from previous years, and are no
longer encouraged as projects. But I didn't want to delete them from
the list entirely, so here they are.
- Last-In First-Out waiting time distribution
-
The waiting time distribution for an M/M/1 queue with First-In
First-Out (FIFO) service is easy. Last-In First-Out (LIFO) is more
complicated, and I haven't seen nice graphs of it. Same for
Service-In-Random-Order (SIRO). Either take the existing formulas and
find good ways to evaluate them numerically, or do transient CTMC ODEs
to find the numbers. Find the variance, and make some
nice graphs. Also plot CDFs to show if they are stochastically ordered.
Also, plot sample paths under FIFO, LIFO, and SIRO for a set of
simulated data.
- A CTMC from Rental Truck Networks
-
Prof. Ross came across a two-dimensional CTMC when pondering how networks
of rental truck depots work. He can explain it to you; you try to find
a form for the steady-state distribution. Also applicable to networks of
modem banks and networks of spare-part supply depots.
- A CTMC on Highway Accidents
-
Read two papers by Prof. Melike Gursoy of Rutgers; program a three-
dimensional CTMC (probably in Matlab) and find steady-state solutions;
see if they correspond to a guess.
- Modeling the spread of rumors
-
Read the article "Rumours and Errours", Brian Hayes, in American Scientist
volume 93 number 3 (May-June 2005) 207-211, and do some
experiments or extensions related to it. A review of the article
appears in the College Math Journal, Vol 37 number 2, March 2006,
page 155-156. Modeling the spread of rumors is very much like modeling
the spread of diseases, so this is a good project for bio-oriented
people.